15 research outputs found

    Decomposable families of itemsets

    Get PDF
    The problem of selecting a small, yet high quality subset of patterns from a larger collection of itemsets has recently attracted a lot of research. Here we discuss an approach to this problem using the notion of decomposable families of itemsets. Such itemset families define a probabilistic model for the data from which the original collection of itemsets was derived. Furthermore, they induce a special tree structure, called a junction tree, familiar from the theory of Markov Random Fields. The method has several advantages. The junction trees provide an intuitive representation of themining results. From the computational point of view, the model provides leverage for problems that could be intractable using the entire collection of itemsets. We provide an efficient algorithm to build decomposable itemset families, and give an application example with frequency bound querying using the model. An empirical study show that our algorithm yields high quality results

    Extending data mining techniques for frequent pattern discovery : trees, low-entropy sets, and crossmining

    Get PDF
    The idea of frequent pattern discovery is to find frequently occurring events in large databases. Such data mining techniques can be useful in various domains. For instance, in recommendation and e-commerce systems frequently occurring product purchase combinations are essential in user preference modeling. In the ecological domain, patterns of frequently occurring groups of species can be used to reveal insight into species interaction dynamics. Over the past few years, most frequent pattern mining research has concentrated on efficiency (speed) of mining algorithms. However, it has been argued within the community that while efficiency of the mining task is no longer a bottleneck, there is still an urgent need for methods that derive compact, yet high quality results with good application properties. The aim of this thesis is to address this need. The first part of the thesis discusses a new type of tree pattern class for expressing hierarchies of general and more specific attributes in unstructured binary data. The new pattern class is shown to have advantageous properties, and to discover relationships in data that cannot be expressed alone with the more traditional frequent itemset or association rule patterns. The second and third parts of the thesis discuss the use of entropy as a score measure for frequent pattern mining. A new pattern class is defined, low-entropy sets, which allow to express more general types of occurrence structure than with frequent itemsets. The concept can also be easily applied to tree types of pattern. Furthermore, by applying minimum description length in pattern selection for low-entropy sets it is shown experimentally that in most cases the collections of selected patterns are much smaller than by using frequent itemsets. The fourth part of the thesis examines the idea of crossmining itemsets, that is, relating itemsets to numerical variables in a database of mixed data types. The problem is formally defined and turns out to be NP-hard, although it is approximately solvable within a constant-factor of the optimum solution. Experiments show that the algorithm finds itemsets that convey structure in both the binary and the numerical part of the data

    Fast text-only domain adaptation of RNN-transducer prediction network

    Get PDF
    Publisher Copyright: Copyright © 2021 ISCA.Adaption of end-to-end speech recognition systems to new tasks is known to be challenging. A number of solutions have been proposed which apply external language models with various fusion methods, possibly with a combination of two-pass decoding. Also TTS systems have been used to generate adaptation data for the end-to-end models. In this paper we show that RNN-transducer models can be effectively adapted to new domains using only small amounts of textual data. By taking advantage of model's inherent structure, where the prediction network is interpreted as a language model, we can apply fast adaptation to the model. Adapting the model avoids the need for complicated decoding time fusions and external language models. Using appropriate regularization, the prediction network can be adapted to new domains while still retaining good generalization capabilities. We show with multiple ASR evaluation tasks how this method can provide relative gains of 10-45% in target task WER. We also share insights how RNN-transducer prediction network performs as a language model.Peer reviewe

    Hierarkkisten luokittelujen päättely 0-1 aineistosta

    No full text
    Hierarkkinen luokittelu on oliojoukon lajittelu hierarkkisesti organisoituihin kategorioihin ja näiden alikategorioihin. Hierarkkinen luokittelu on paljon käytetty tekniikka niin tieteellisen kuin kaupallisenkin tiedon mallintamisessa. Esimerkkejä sovellusalueista löytyy muun muassa systeemibiologian, lääketieteen, asiakasdata-analyysin ja tekoälyn piiristä. Tässä diplomityössä käsitellään hierarkkisten luokittelujen päättelyä ja siihen liittyviä kysymyksiä. Työn rakenteen voi jakaa kahteen osaan. Ensimmäisessä osassa tarkastellaan hierarkkisten luokittelujen päättelymenetelmiä sekä niihin liittyvää teoriaa. Erityisesti diplomityö keskittyy joihinkin hierarkkisten luokittelujen muodostusmenetelmiin sekä luokittelujen keskinäisen vertailun menetelmiin. Lisäksi erilaisuusmittojen määrittäminen data-olioiden välillä on tärkeä teema. Diplomityön toisessa osassa menetelmiä ja määritelmiä sovelletaan Euroopan nisäkkäiden esiintymistä käsittelevään tietokantaan. Käytännön tutkimusongelmana on selvittää tukeeko nisäkkäiden esiintyminen hierarkkisen luokittelun mallia. Diplomityössä analysoidaan levinneisyyteen perustuvien etäisyysmittojen käyttäytymistä nisäkkäiden välillä. Tämän pohjalta muodostetaan joukko hierarkkisia luokitteluja käyttäen sekä kokoavaa klusterointia että ahnetta hierarkia-puun hakustrategiaa. Hierarkkisen luokittelumallin sopivuutta nisäkäsaineistoon arvioidaan käyttäen Monte Carlo- ja Bootstrap -menetelmiä, joista molemmat perustuvat alkuperäisen aineiston uudelleenotantaan. Tulokset antavat uskottavan, maantieteellisen jakoon perustuvan hierarkkisen luokittelun aineiston nisäkkäille

    The Crowd-Median Algorithm

    No full text
    The power of human computation is founded on the capabilities of humans to process qualitative information in a manner that is hard to reproduce with a computer. However, all machine learning algorithms rely on mathematical operations, such as sums, averages, least squares etc. that are less suitable for human computation. This paper is an effort to combine these two aspects of data processing. We consider the problem of computing a centroid of a data set, a key component in many data-analysis applications such as clustering, using a very simple human intelligence task (HIT). In this task the workers must choose the outlier from a set of three items. After presenting a number of such triplets to the workers, the item chosen the least number of times as the outlier is selected as the centroid. We provide a proof that the centroid determined by this procedure is equal the mean of a univariate normal distribution. Furthermore, as a demonstration of the viability of our method, we implement a human computation based variant of the k-means clustering algorithm. We present experiments where the proposed method is used to find an "average" image in a collection, and cluster images to semantic categories

    Finding Low-Entropy Sets and Trees from Binary Data ABSTRACT

    No full text
    The discovery of subsets with special properties from binary data has been one of the key themes in pattern discovery. Pattern classes such as frequent itemsets stress the co-occurrence of the value 1 in the data. While this choice makes sense in the context of sparse binary data, it disregards potentially interesting subsets of attributes that have some other type of dependency structure. Weconsidertheproblemoffindingallsubsetsofattributes that have low complexity. The complexity is measured by either the entropy of the projection of the data on the subset, or the entropy of the data for the subset when modeled using a Bayesian tree, with downward or upward pointing edges. We show that the entropy measure on sets has a monotonicity property, and thus a levelwise approach can find all low-entropy itemsets. We also show that the treebasedmeasuresareboundedabovebytheentropyofthecorresponding itemset, allowing similar algorithms to be used for finding low-entropy trees. We describe algorithms for finding all subsets satisfying an entropy condition. We give an extensive empirical evaluation of the performance of the methodsbothonsyntheticandonrealdata. Wealsodiscuss the search for high-entropy subsets and the computation of the Vapnik-Chervonenkis dimension of the data

    Finding Trees From Unordered 0--1 Data

    No full text
    Tree structures are a natural way of describing occurrence relationships between attributes in a dataset. We define a new class of tree patterns for unordered 0--1 data and consider the problem of discovering frequently occurring members of this pattern class. Intuitively, a tree T occurs in a row u of the data, if the attributes of T that occur in u form a subtree of T containing the root. We show that this definition has advantageous properties: only shallow trees have a significant probability of occurring in random data, and the definition allows a simple levelwise algorithm for mining all frequently occurring trees. We demonstrate with empirical results that the method is feasible and that it discovers interesting trees in real data

    Cross-mining Binary and Numerical Attributes

    No full text
    We consider the problem of relating itemsets mined on binary attributes of a data set to numerical attributes of the same data. An example is biogeographical data, where the numerical attributes correspond to environmental variables and the binary attributes encode the presence or absence of species in different environments. From the viewpoint of itemset mining, the task is to select a small collection of interesting itemsets using the numerical attributes; from the viewpoint of the numerical attributes, the task is to constrain the search for local patterns (e.g. clusters) using the binary attributes. We give a formal definition of the problem, discuss it theoretically, give a simple constant-factor approximation algorithm, and show by experiments on biogeographical data that the algorithm can capture interesting patterns that would not have been found using either itemset mining or clustering alone. 1
    corecore